Unsupervised Learning Project

Alok Sawant

Importing the Libraries

Function to calculate accuracy

Part-1

Automobile

Importing the data (Task-1)

Import part1 of dataset

The given data set contains 398 rows and 1 columns

Import part 2 of data set

The given data set contains 398 rows and 8 columns Merge

Data Cleansing (Task -2)

In the given dataset hp columns is object may be because it is having non-numerical values

Data analysis and visualisation (Task - 3)

1) MPG vs origin

In the plot above, the red line shows the global average of mpg. The majority of cars belonging to origin 1 has average mpg below global average mpg The majority of cars belonging to origin 2 and 3 has average mpg above global average mpg

2) MPG vs cylinder

The above plot shows that majority of the cars having 4 or 5 cylender have above global average mpg.

3) MPG vs year

The plot shows that the majority of the cars befor year 79 has average mpg below global average mpg, except for year 74 (where average mpg is slightly above global average mpg)

4) MPG vs Company names

In the given dataset the car_name attribute contains Company_name - Model name - variant. These can be seperated for further analysis

The above plot shows mpg distribution for the various car manufacturer

Pair plot for the dataset

Machine learning (Task - 4)

Scaling the data

K means clustering

The location of a knee in the plot is usually considered as an indicator of the appropriate number of clusters because it means that adding another cluster does not improve much better the partition. The above plot shows clear knee in elbow at k=2.

Hierarchial clustering

In the given data set K means method showed two optimal clusters in the given data set. While, using Hierarchical clustering method the dendrogram above shows that the data can be divided in 2 optimal clusters.

Splitting data into training and testing set

Linear regression model

For cluster 1

For Cluster 2

When all of the dataset is used as single cluster, the linear regression model showed accuracy of 82%. When the data set is split into two cluster and the linear regression model showed accuracy of 67% for cluster 1 and 71% for cluster 2. The decrease in the accuracy is may be due to less dataset available for training and testing purpose while using two different clusters. To imporve the prediction accuracy further more data points are needed for linearregression model in each clusters.

Part-2

There are 18 missing values in the dataframe for Quality attribute

Imputing the missing value for the attribute 'Quality'

Part-3

The dataframe consists of 846 rows and 19 columns

Most of the data in the given dataframe is numberical except for the 'Class' attribute which shows category of the vehical

Majority of the target class contains cars followed by bus and van. Almost 50% of the dataset contains cars

1) In the given dataset 'compactness' and 'circularity' attirbute seems to be normally distributed.

2) Most of the attributes in the dataset are positively skewed.

1) The 'mean compectness', 'mean circularity' and the 'mean distance circularity' is highest for the cars.

2) The compectness is least for the van, and positively skewed data for bus indicates that few buses have higher compactness.

3) Most of the other attirbutes distribution is similar for the car, van and bus

4) There are outliers in 'columns scaled_cariance', 'scaled_variance.1', 'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1', 'radius_ratio' and 'max_length_aspect_ratio' columns

Classifier - (task - 2 )

The model shows 95% accuracy using all the attributes given in the dataset

Dimensional reduction (task - 3 )

Applying principlal component analysis for dimensional reduction

Eigen values

Eigen Vectors

The percentage of variation explained by each eigen Vector

In the above plots, increasing principal compoenets drop in variance is observed. Using the above plot 8 principal components were selected as these 8 components explains 95% of variance in the data.

SVM classifier (Task -5)

The model shows 94% accuray while using 8 principle components

To calculate classification results SVM model has been implemented. PCA has been implemented where we have extracted features by reducing dimensions from 18 to 8. Reducing the dimension of feature space, we have fewer relationships between variables to consider and are less likely to overfit model and as an added benefit, each of the “new” variables after PCA are all independent of one another as seen in pairplot. The accrucy of the results predicted by SVM classifier using all attribute (18) is 95%. However, the classification results predicted by the SVM classifier using PCA dimension reduction technique and using 8 attributes is around 94%.

Part-4

In the given data set every alternative row contains null value.

The data set contains 6 different attributes, hence to rank the players the dimensiionality reduction technique can be used to select most relvant features to rank the player

Eigen values

Eigen vectors

The percentage of variation explained by each eigen Vector

From the above plot it can be seen that the 90% of variance in the data can be explained by 'runs', 'average' and 'strike rate' attributes. Hence, using these attritibute for ranking the players.

Ranking the players based on runs scored

Ranking the players based on average

Ranking the player based on Strike Rate

Part -5

The most commanly implemented Dimensionality Reduction Techniques in python are Missing Value Ratio 1) Low Variance Filter

2) High Correlation Filter

3) Random Forest

4) Backward Feature Elimination

5) Forward Feature Selection

6) Factor Analysis

7) Principal Component Analysis

8) Independent Component Analysis

9) Methods Based on Projections

10) t-Distributed Stochastic Neighbor Embedding (t-SNE)

11) UMAP